Importing Packages

library(readr)
library(dplyr)
library(tidyverse)
library(ROCR)
library(ggplot2)
library(ggridges)
library(plotly)
library(ggbreak)
library(maps)
library(mapdata)
library(ggmap)
library(gapminder)
library(kableExtra)
library(dendextend)
library(tree)
library(maptree)
library(glmnet)
library(randomForest)
library(gbm)
library(neuralnet)

Data Presenting

Election Dataset
state county candidate party total_votes
Delaware Kent Joe Biden DEM 44552
Delaware Kent Donald Trump REP 41009
Delaware Kent Jo Jorgensen LIB 1044
Delaware Kent Howie Hawkins GRN 420
Delaware New Castle Joe Biden DEM 195034
Census Dataset
CountyId State County TotalPop Men Women Hispanic White Black Native Asian Pacific VotingAgeCitizen Income IncomeErr IncomePerCap IncomePerCapErr Poverty ChildPoverty Professional Service Office Construction Production Drive Carpool Transit Walk OtherTransp WorkAtHome MeanCommute Employed PrivateWork PublicWork SelfEmployed FamilyWork Unemployment
1001 Alabama Autauga County 55036 26899 28137 2.7 75.4 18.9 0.3 0.9 0 41016 55317 2838 27824 2024 13.7 20.1 35.3 18.0 23.2 8.1 15.4 86.0 9.6 0.1 0.6 1.3 2.5 25.8 24112 74.1 20.2 5.6 0.1 5.2
1003 Alabama Baldwin County 203360 99527 103833 4.4 83.1 9.5 0.8 0.7 0 155376 52562 1348 29364 735 11.8 16.1 35.7 18.2 25.6 9.7 10.8 84.7 7.6 0.1 0.8 1.1 5.6 27.0 89527 80.7 12.9 6.3 0.1 5.5
1005 Alabama Barbour County 26201 13976 12225 4.2 45.7 47.8 0.2 0.6 0 20269 33368 2551 17561 798 27.2 44.9 25.0 16.8 22.6 11.5 24.1 83.4 11.1 0.3 2.2 1.7 1.3 23.4 8878 74.1 19.1 6.5 0.3 12.4
1007 Alabama Bibb County 22580 12251 10329 2.4 74.6 22.0 0.4 0.0 0 17662 43404 3431 20911 1889 15.2 26.6 24.4 17.6 19.7 15.9 22.4 86.4 9.5 0.7 0.3 1.7 1.5 30.0 8171 76.0 17.4 6.3 0.3 8.2
1009 Alabama Blount County 57667 28490 29177 9.0 87.4 1.5 0.3 0.1 0 42513 47412 2630 22021 850 15.6 25.4 28.5 12.9 23.3 15.8 19.5 86.8 10.2 0.1 0.4 0.4 2.1 35.0 21380 83.9 11.9 4.0 0.1 4.9

Election Data

  • The dimension of the election data is 32177 x 5
  • There is no missing value in the election data set
  • There are 51 unique value in the state (51 States)

Census Data:

  • The dimension of census data is 3220 x 37
  • There is missing value in the census data set
  • There are 1955 unique value in the County column in census data set
  • There are 2825 unique value in the County column in election data set

The data set election.raw has more county than the data set census.

Data Wrangling

Total votes for each candidate
candidate TOTAL
Alyson Kennedy 6791
Bill Hammons 6647
Blake Huber 409
Brian Carroll 25256
Brock Pierce 49552

Data wrangling with candidates

## [1] "There are total 38 named presidential candidates in the 2020 election"

### State winner and County winner

County Winner
county state candidate party total_votes total pct
Abbeville South Carolina Donald Trump REP 8215 12433 0.6607416
Abbot Maine Donald Trump REP 288 417 0.6906475
Abington Massachusetts Joe Biden DEM 5209 9660 0.5392340
Acadia Parish Louisiana Donald Trump REP 22596 28425 0.7949340
Accomack Virginia Donald Trump REP 9172 16962 0.5407381
State Winner
state candidate
Alabama Donald Trump
Alaska Donald Trump
Arizona Joe Biden
Arkansas Donald Trump
California Joe Biden

Visualization

Data manipulation for census data set

Census.clean Dataset
CountyId State County TotalPop Men Women White VotingAgeCitizen Income Poverty ChildPoverty Professional Service Office Production Drive Carpool Transit OtherTransp WorkAtHome MeanCommute Employed PrivateWork SelfEmployed FamilyWork Unemployment Minority
1001 Alabama Autauga County 55036 48.87528163% 28137 75.4 74.52576495% 55317 13.7 20.1 35.3 18.0 23.2 15.4 86.0 9.6 0.1 1.3 2.5 25.8 43.81132350% 74.1 5.6 0.1 5.2 Pacific
1003 Alabama Baldwin County 203360 48.94128639% 103833 83.1 76.40440598% 52562 11.8 16.1 35.7 18.2 25.6 10.8 84.7 7.6 0.1 1.1 5.6 27.0 44.02389851% 80.7 6.3 0.1 5.5 Pacific
1005 Alabama Barbour County 26201 53.34147552% 12225 45.7 77.35964276% 33368 27.2 44.9 25.0 16.8 22.6 24.1 83.4 11.1 0.3 1.7 1.3 23.4 33.88420289% 74.1 6.5 0.3 12.4 Pacific
1007 Alabama Bibb County 22580 54.25597874% 10329 74.6 78.21966342% 43404 15.2 26.6 24.4 17.6 19.7 22.4 86.4 9.5 0.7 1.7 1.5 30.0 36.18689105% 76.0 6.3 0.3 8.2 Asian
1009 Alabama Blount County 57667 49.40433870% 29177 87.4 73.72153918% 47412 15.6 25.4 28.5 12.9 23.3 19.5 86.8 10.2 0.1 0.4 2.1 35.0 37.07493020% 83.9 4.0 0.1 4.9 Pacific

Dimensionality Reduction

In order to have a better result, I choose center and scale the features before running the PCA. I removed ‘Minority’ which is a character data type column, also covert ‘Men’, ‘VotingAgeCitizen’, ‘Employed’ from percentage to numbers so we could get better results.

## ChildPoverty      Poverty     Employed 
##    0.3884449    0.3833092    0.3624495

The three features with the largest absolute values of the first principle component are ChildPoverty, Poverty, Employed.

##      OtherTransp      PrivateWork VotingAgeCitizen 
##      0.001375716      0.049135190      0.050048362

The opposite signs are otherTransp, PrivateWork, VotingAgeCitizen. And it means that these three variables are not straight related with the data. In other words they only had light correlation with the target.

Proportion of Variance Explained and Cumulative Proportion of Variance Explained

We need around 10 PCs to capture 90% of the variance for the analysis

Clustering

Applying clustering method to the data set

## clus.10
##    1    2    3    4    5    6    7    8    9   10 
## 2131  128  892    6   14    1   11    7   25    4
## clus.10_pc_First_Twp
##    1    2    3    4    5    6    7    8    9   10 
## 3036    6    1  114   19    3   20    1   15    4

The first 2 component clustering seems better because it has less small classes of the clustering. First 2 component clustering seems more appropriate to Santa Barara County because it puts with multiple other California counties

Classification

The reason why we need to exclude the predictor ‘party’ from election.c1 is because party is a character data type columns, further more we are predicting who has won the county/state which means we are looking for census and population’s data to predict who will they vote for. And party is describing the candidate which is not our aim.

Decision Tree Methods:

Records Dataset
train.error test.error
tree 0.0775087 0.0856354
logistic NA NA
lasso NA NA

The test error rate is high which means the model might be over fitting and decision tree is not the best algorithm during this situation

According to the graph, we see that decision tree first had separate transit and second it would depend on if the citizen is white or not. Further more it would depend on self-employed or professional and total population or production.

Logisitc Regression Methods:

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Call:
## glm(formula = candidate ~ ., family = binomial, data = election.tr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7362  -0.2479  -0.0851  -0.0104   3.8603  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -7.598e+00  6.424e+00  -1.183 0.236872    
## TotalPop         -9.395e-06  4.041e-05  -0.232 0.816167    
## Men              -2.872e-04  3.162e-04  -0.908 0.363709    
## Women             2.197e-05  7.976e-05   0.275 0.783028    
## White            -1.348e-01  1.301e-02 -10.358  < 2e-16 ***
## VotingAgeCitizen  2.299e-03  3.702e-04   6.211 5.25e-10 ***
## Income           -7.479e-06  2.131e-05  -0.351 0.725670    
## Poverty          -1.532e-02  5.297e-02  -0.289 0.772355    
## ChildPoverty      2.152e-02  3.286e-02   0.655 0.512459    
## Professional      3.036e-01  5.104e-02   5.948 2.71e-09 ***
## Service           3.045e-01  6.085e-02   5.004 5.62e-07 ***
## Office            2.012e-01  6.347e-02   3.170 0.001524 ** 
## Production        2.005e-01  5.179e-02   3.871 0.000108 ***
## Drive            -2.011e-01  4.799e-02  -4.191 2.78e-05 ***
## Carpool          -2.118e-01  6.532e-02  -3.242 0.001185 ** 
## Transit           2.652e-01  1.211e-01   2.191 0.028485 *  
## OtherTransp       1.750e-02  1.225e-01   0.143 0.886382    
## WorkAtHome       -1.105e-01  7.368e-02  -1.500 0.133514    
## MeanCommute       6.144e-03  3.168e-02   0.194 0.846227    
## Employed          3.054e-03  5.052e-04   6.045 1.50e-09 ***
## PrivateWork       3.075e-02  2.647e-02   1.162 0.245339    
## SelfEmployed      9.804e-05  5.703e-02   0.002 0.998628    
## FamilyWork       -1.813e+00  6.984e-01  -2.596 0.009442 ** 
## Unemployment      1.971e-01  5.340e-02   3.692 0.000223 ***
## MinorityBlack     1.245e+00  1.340e+00   0.929 0.353022    
## MinorityHispanic -9.237e+00  6.357e+02  -0.015 0.988406    
## MinorityNative    2.375e+00  1.127e+00   2.108 0.035011 *  
## MinorityPacific   2.167e+00  1.122e+00   1.932 0.053386 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1497.31  on 1444  degrees of freedom
## Residual deviance:  498.71  on 1417  degrees of freedom
## AIC: 554.71
## 
## Number of Fisher Scoring iterations: 14
##      (Intercept)         TotalPop              Men            Women 
##        -7.598256        -0.000009        -0.000287         0.000022 
##            White VotingAgeCitizen           Income          Poverty 
##        -0.134788         0.002299        -0.000007        -0.015325 
##     ChildPoverty     Professional          Service           Office 
##         0.021523         0.303613         0.304483         0.201189 
##       Production            Drive          Carpool          Transit 
##         0.200494        -0.201114        -0.211797         0.265177 
##      OtherTransp       WorkAtHome      MeanCommute         Employed 
##         0.017500        -0.110546         0.006144         0.003054 
##      PrivateWork     SelfEmployed       FamilyWork     Unemployment 
##         0.030753         0.000098        -1.812870         0.197134 
##    MinorityBlack MinorityHispanic   MinorityNative  MinorityPacific 
##         1.244683        -9.237465         2.375326         2.167049
Records Dataset
train.error test.error
tree 0.0775087 0.0856354
logistic 0.0643599 0.0911602
lasso NA NA

The significant variable became professional and it is different than the decision tree model. The coefficient for Professional is around 0.39 which means it is affecting the decision at most in all of the variables.

Lasso Regression:

## 25 x 1 sparse Matrix of class "dgCMatrix"
##                         s0
## (Intercept)      -4.747366
## (Intercept)       .       
## TotalPop          0.000001
## Men              -0.000367
## Women             0.000002
## White            -0.113226
## VotingAgeCitizen  0.002067
## Income            0.000004
## Poverty           0.026369
## ChildPoverty      0.008004
## Professional      0.220289
## Service           0.216087
## Office            0.140080
## Production        0.115647
## Drive            -0.143360
## Carpool          -0.149760
## Transit           0.235636
## OtherTransp       0.065751
## WorkAtHome       -0.049878
## MeanCommute      -0.001856
## Employed          0.002587
## PrivateWork       0.026405
## SelfEmployed     -0.048529
## FamilyWork       -1.351018
## Unemployment      0.171172

The optimal value for \(\lambda\) is 0.0013. Non-zeros are white, transit, unemployment.Compare to unpenalized logistic regression, lasso regression has enhanced the key coefficients.

ROC Curves:

Random Forest Methods:

## [1] "The test error rate is 0.0828729281767956"

Boosting Methods:

##                               var    rel.inf
## Transit                   Transit 21.7580123
## White                       White 20.5745261
## TotalPop                 TotalPop  8.3418241
## Professional         Professional  7.1804795
## Women                       Women  5.2614706
## VotingAgeCitizen VotingAgeCitizen  4.9431347
## Employed                 Employed  4.4275587
## Unemployment         Unemployment  3.0682501
## Men                           Men  2.8770159
## SelfEmployed         SelfEmployed  2.8269815
## Service                   Service  2.5871555
## Production             Production  2.4250571
## Income                     Income  1.9888523
## ChildPoverty         ChildPoverty  1.9331697
## PrivateWork           PrivateWork  1.7037428
## MeanCommute           MeanCommute  1.3529113
## WorkAtHome             WorkAtHome  1.3049564
## Office                     Office  1.2172617
## OtherTransp           OtherTransp  1.1975903
## Drive                       Drive  1.1621989
## Poverty                   Poverty  1.0896626
## Carpool                   Carpool  0.6107129
## FamilyWork             FamilyWork  0.1674751

## [1] "The test error rate is 0.0856353591160221"

From the graph we can see the relation between emplyed and unedployment with the prediction

Random forest had better result than decision tree because random forest avoid the error(over fitting) caused by multiple tree classes.And boosting is having similar result as decision tree, logistic regression, and lasso regression. However, the top variable is different than decision tree, logistic regression, and lasso regression.

Comparing linear regression(less flexible) and neural network classification(more flexible) in the election data.

Linear Regression Methods:

## 
## Call:
## lm(formula = total_votes ~ TotalPop + Men + Women + White + VotingAgeCitizen + 
##     Income + Poverty + ChildPoverty + Professional + Service + 
##     Office + Production + Drive + Carpool + Transit + OtherTransp + 
##     WorkAtHome + MeanCommute + Employed + PrivateWork + SelfEmployed + 
##     FamilyWork + Unemployment, data = lm.data.tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -340263  -41276  -28222    -768 1661763 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.387e+05  1.428e+05   2.372 0.017804 *  
## TotalPop         -4.799e-01  6.538e-01  -0.734 0.463119    
## Men              -2.049e+00  3.423e+00  -0.599 0.549485    
## Women             1.505e+00  1.288e+00   1.169 0.242798    
## White             2.827e+02  1.994e+02   1.418 0.156387    
## VotingAgeCitizen  4.311e+00  3.899e+00   1.106 0.269070    
## Income           -3.751e-01  5.096e-01  -0.736 0.461850    
## Poverty          -1.610e+03  1.317e+03  -1.222 0.221799    
## ChildPoverty      7.503e+02  7.397e+02   1.014 0.310601    
## Professional     -7.225e+02  9.109e+02  -0.793 0.427792    
## Service          -1.106e+03  1.096e+03  -1.009 0.313191    
## Office           -3.913e+03  1.178e+03  -3.321 0.000920 ***
## Production       -1.990e+02  9.266e+02  -0.215 0.829946    
## Drive            -8.062e+02  1.238e+03  -0.651 0.515144    
## Carpool          -1.778e+03  1.569e+03  -1.133 0.257220    
## Transit          -2.581e+03  1.819e+03  -1.419 0.156103    
## OtherTransp      -1.934e+03  2.730e+03  -0.708 0.478790    
## WorkAtHome       -5.575e+01  1.847e+03  -0.030 0.975928    
## MeanCommute       1.912e+02  6.075e+02   0.315 0.752962    
## Employed          3.363e+00  5.343e+00   0.629 0.529174    
## PrivateWork      -7.877e+02  6.055e+02  -1.301 0.193534    
## SelfEmployed     -3.940e+03  1.170e+03  -3.369 0.000776 ***
## FamilyWork       -5.037e+03  6.330e+03  -0.796 0.426321    
## Unemployment      5.610e+02  1.353e+03   0.415 0.678405    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 97860 on 1421 degrees of freedom
## Multiple R-squared:  0.5592, Adjusted R-squared:  0.5521 
## F-statistic: 78.39 on 23 and 1421 DF,  p-value: < 2.2e-16
## [1] "The MSE is 13473109137.6167"

Neural Network Methods:

Neural Network Prediction
actual Donald Trump Joe Biden Prediction
6 Donald Trump 0.8314220 0.1686340 Donald Trump
11 Joe Biden 0.6474821 0.3525925 Donald Trump
14 Joe Biden 0.1000002 0.9000184 Joe Biden
16 Joe Biden 0.2523672 0.7477261 Joe Biden
27 Donald Trump 0.8314220 0.1686340 Donald Trump
## [1] "The test error rate is 0.213035606517803"

Conclusion:

Final Records
train.error test.error
tree 0.0775087 0.0856354
logistic 0.0643599 0.0911602
lasso NA NA
random forest NA 0.0828729
boosting NA 0.0856354
neural network NA 0.2130356